To get a better understanding about how price-setting in the sharing economy works, a wide range of papers have used a hedonic price model to test the consumer valuation of Airbnb listings (e.g. Gibbs et al. (2018), Teubner et al. (2017)). In this kind of modelling, structured attributes (number of rooms, location, rating, etc.) of the listing often together with attributes of the host are used, to evaluate the source of consumer utility.
In the following analysis I want to exploit the textual data in listing description to predict the price of a listing.
Research questions:
Method:
To compare my approach with the conventional methods, I first estimate a model in which I use the structured attributes as exogenous regressors to predict the price of an Airbnb listing. Afterwards, I use textual features of the same listings to predict the prices and compare the two models.
The project is divided into three parts. In this section I describe the data set and how I prepare it for analysis. In the second part I estimate a linear model with the conventional attributes and in the third part I use text data for the same listings.
I use a unique dataset that contains information on 47.006 Airbnb listings from seven major German cities, namely Berlin, Munich, Hamburg, Cologne, Dresden, Stuttgart and Frankfurt am Main. Listings were gathered directly from Airbnb’s website in September 2017 using a custom web scraper. In this way I have obtained all publicly available information for a listing, including but not limited to prices, accommodation features, reviews and host details.
head(rooms)## # A tibble: 6 x 62
## room_id host_id room_type country city neighborhood address price
## <int> <int> <chr> <chr> <chr> <chr> <chr> <int>
## 1 19117409 1.34e⁸ Entire ho… Deutsc… Hamb… <NA> Othmarsch… 129
## 2 5728058 3.34e⁵ Entire ho… Deutsc… Hamb… <NA> Neustadt,… 116
## 3 19954984 1.41e⁸ Entire ho… Deutsc… Münc… <NA> Schwabing… 91
## 4 9918551 5.10e⁷ Entire ho… Deutsc… Schö… <NA> Schönefeld 43
## 5 13836114 8.16e⁷ Entire ho… Deutsc… Hamb… <NA> Eimsbütte… 61
## 6 20355318 8.02e⁷ Entire ho… Deutsc… Köln <NA> Köln 49
## # ... with 54 more variables: nightly_price <int>, reviews <int>,
## # accommodates <int>, bathrooms <int>, bedrooms <int>, bed_type <chr>,
## # minstay <int>, last_modified <dttm>, latitude <dbl>, longitude <dbl>,
## # survey_id <int>, location <chr>, coworker_hosted <chr>,
## # extra_host_languages <chr>, name <chr>, property_type <chr>,
## # currency <chr>, rate_type <chr>, overall_satisfaction <chr>,
## # cleanliness_satisfaction <int>, communication_satisfaction <int>,
## # location_satisfaction <int>, accuracy_satisfaction <int>,
## # checkin_satisfaction <int>, value_satisfaction <chr>, amenities <chr>,
## # cancel_policy <chr>, instant_book <chr>, response_time <chr>,
## # response_rate <dbl>, friend_count <int>, wishlist_count <int>,
## # pic_count <chr>, superhost <chr>, description_language <chr>,
## # hostname <chr>, rule_children <chr>, rule_infants <chr>,
## # rule_pets <chr>, rule_smoking <chr>, rule_events <chr>,
## # hostprofilepic <chr>, cleaning_fee <chr>, security_deposit <chr>,
## # last_review <dttm>, positive_reviews <dttm>, negative_reviews <date>,
## # last_cal_update <chr>, member_since <chr>, host_verified <chr>,
## # deleted <chr>, filled <chr>, description <chr>, base_price <chr>
# Convert strings to numeric
rooms <- rooms %>%
mutate(overall_satisfaction = as.numeric(overall_satisfaction),
pic_count = as.numeric(pic_count)) %>%
filter(!is.na(overall_satisfaction))Keep only listings from the following cities: Hamburg, München, hamburg, Köln, FFM, Dresden, Stuttgart
## create clean-up function
create_city <- function(x, city){
city_clean <- ifelse(grepl(x, city),x , city)
return(city_clean)
}city_list <- c("Hamburg","München","Berlin","Frankfurt","Köln","Stuttgart","Dresden")
for(i in city_list){
rooms$city <- create_city(i, rooms$city)
}
rooms %>%
filter(city %in% city_list) -> rooms
rooms %>%
group_by(city) %>%
tally() %>%
ggplot(aes(reorder(city, n, desc),n)) +
geom_col(fill = col[3], alpha = 0.8) +
labs(x="", y="", title="Count")rooms %>%
group_by(property_type) %>%
tally() %>%
ggplot(aes(reorder(property_type, n),n)) +
geom_col(fill = col[3], alpha = 0.8) +
labs(x="", y="", title="Property Types") +
coord_flip()To keep things simple, I will just keep listings of property type “Wohnung” (apartment)
rooms %>%
filter(property_type == "Wohnung") -> roomsrooms %>%
ggplot(aes(room_type)) +
geom_bar(fill = col[3], alpha = 0.8) +
labs(x="", y="")rooms %>%
ggplot(aes(city, price)) +
geom_boxplot(outlier.size = 0)Apparently, there are some outliers. After cheking the respective listings, I decided to exclude them.
rooms %>%
filter(price < 1500) -> roomsrooms$price.cut <- cut(rooms$price, c(seq(0,500,1), Inf))
rooms %>%
ggplot(aes(as.numeric(price.cut), factor(city))) +
geom_density_ridges(scale = 5,
fill = col[3], alpha = 0.7,
color = "white") +
theme_ridges() +
scale_x_continuous(expand = c(0, 0), labels = c(seq(0,400,100),">500")) +
labs(y="", x="Price")rooms %>%
ggplot(aes(overall_satisfaction, factor(room_type))) +
geom_density_ridges(scale = 5,
fill = col[3], alpha = 0.7,
color = "white") +
scale_x_continuous(expand = c(0, 0)) +
labs(y="", x="Rating")Next, I exclude listings with less than three reviews, as it can be assumed that these listings have never been booked, or only very little.
rooms %>%
filter(reviews >= 3) -> roomsrooms$reviews.cut <- cut(rooms$reviews, c(seq(0,50,1), Inf))
rooms %>%
ggplot(aes(as.numeric(reviews.cut), factor(city))) +
geom_density_ridges(scale = 5,
fill = col[3], alpha = 0.7,
color = "white") +
scale_y_discrete(expand = c(0,0)) +
scale_x_continuous(expand = c(0,0),
breaks = c(seq(0,50,10)),
labels = c(seq(0,40,10),">50")) +
labs(y="", x="Number of Reviews")df <- rooms %>%
select(room_id, name,
description, city, price, overall_satisfaction,
room_type, bed_type, pic_count,
reviews, accommodates, bedrooms, minstay,
latitude, longitude) %>%
mutate(fulltext = paste(name, description, sep=" "))Turning to the text data, lets first have a quick look at three random descriptions:
rooms %>% sample_n(3) %>%
select(description) %>%
knitr::kable(align = "l")| description |
|---|
| Mein Zimmer in einer WG ist vom 18.04-22.04 frei. Darum vermiete ich dieses für den Zeitraum unter. Ich wohne mit 3 sehr freundlichen Mitbewohnern (22-25). Sie sprechen sprechen alle fließend Deutsch und Englisch. |
| Hallo! Willkommen in meiner sonnigen, ruhigen und gemütlichen 2-Zimmer Wohnung in Wilmersdorf! Wer wie Zuhause wohnen will, ist hier richtig! |
| Beautiful modern design apartment in the heart of Munich - fully equipped kitchen - spacious living room with balcony facing the courtyard. Perfect location - 2 minutes - to the famous Maximilianstrasse - Underground parking available upon request. Schöne moderne Design- Wohnung im Herzen von München - voll ausgestattete Küche - großzügiges Wohnzimmer mit Balkon zum Innenhof. Perfekte Lage in der Nähe - 2min. - zur berühmten Maximilianstrasse - Tiefgaragenstellplatz verfügbar. |
In which languages are the descriptions written?
load(file = "../output/prep1.Rda")df %>% group_by(language) %>%
tally() %>%
ggplot(aes(reorder(language, n),n)) +
geom_col(fill = col[3], alpha = 0.7) +
coord_flip() +
labs(x="",y="")Check sample articles if the classification is valid
df %>%
sample_n(5) %>%
select(fulltext, language) %>%
knitr::kable()| fulltext | language |
|---|---|
| Helle Wohnung gepflegt und ruhig Schöne, neu eingerichtete helle 1 Zimmer Wohnung mit komplett neuem Interieur und Elektrogeräten. Das Badezimmer ist weiss gefliest, mit Fenster. Innerhalb weniger Gehminuten erreichst Du alle Restaurants und Cafes in Friedrichshain. | german |
| Bright room in 1870s-building, nice neighbourhood In einem Altbau von 1890 seid ihr zu Gast in einem großen, hellen Zimmer. Schön eingerichtet für jede und für jeden, der Farben, Bücher, Musik und Katzen mag. | german |
| Zentral, ruhig, hell, viel Platz In der 1. Etage eines Einfamilienhauses befindet sich eine eigene Wohnung mit 2 Schlafräumen. Ab 3 Personen stehen 3 Schlafräume zur Verfügung (20, 22 bzw. 25 qm) . Dazu gibt es eine Küche , in der gekocht und gegessen werden kann , ein Bad mit Dusche und Wanne und eine Terrasse . Das Haus ist von einem Garten umgeben und befindet sich im grünen Stadtteil Dresden- Strehlen . Um das Stadtzentrum von Dresden zu erreichen , stehen eine Bus- und 2 Straßenbahnlinien zur Verfügung (Haltestelle in 3 Minuten und Fahrtzeit 10-15 Minuten). Der nahe Bahnhof von Dresden - Strehlen lädt zu Fahrten nach Meißen und in die Sächs. Schweiz ein. Etliche Gaststätten , 1 Supermarkt (Konsum) und eine gute Bäckerei sorgen für den täglichen Bedarf. Ich bewohne das Erdgeschoß und vermiete die 1. Etage beinahe ganzjährig als Ferienwohnung. Ich bin Urdresdner und zeige meinen Gästen gern bei Bedarf die Stadt und ihre schöne Umgebung. Seit Februar 2017 erhebt die Stadt Dresden eine Beherbergungssteuer, in meinem Fall 1-2 Euro pro Person und Nacht. Das Geld ist bei Anreise gegen Quittung an mich bar zu zahlen. Ich gebe es dann an die Stadt weiter. Zeit für An- und Abreise kann bei Bedarf individuell geregelt werden. Während des Aufenthalts der Gäste stehe ich gern als Ansprechpartner zur Verfügung. | german |
| Loft with Huge Terrace This apartment is simply the best: Top location, great design, nice kitchen (all appliances incl. dish washer & washing machine), wonderful king size bed, bathroom with tub. The terrace offers a lovely view into Berlins first bycicle-street. | english |
| 1 bedroom Prenzlauer Berg Cosy and bright City Apartment in the hart of Prenzlauer Berg, closed to Schönhauser Allee. Spacios 50m2 with 2 rooms to chare , large kittchen. There is a 2nd bed/couch. You dont need a car, everything you find in walking distance, Bars & Restaurants, train Station Schönhauser Allee, Cinemax, Shopping Center. | english |
Ok, looks good. Lets only keep listings with german and english descriptions.
df %>%
filter(language %in% c("german","english")) -> dfggplot(df, aes(x=factor(city))) +
geom_bar(aes(fill = language),
alpha = 0.8) +
labs(x="", y="", fill="")It is not surprising that Berlin seems to be the most international city, measured by the listings that have their description in English. But I am a little disappointed with Hamburg…
How long are the descriptions on average?
df$text_length <- sapply(gregexpr("\\S+", df$fulltext), length)df$text_length.cut <- cut(df$text_length, c(seq(0,150,1),Inf))
df %>%
ggplot(aes(as.numeric(text_length.cut), factor(city))) +
geom_density_ridges(aes(fill = language),
color = "white", alpha = 0.8) +
scale_x_continuous(expand = c(0,0),
labels = c(seq(0,100,50),">150")) +
labs(y = "", x = "Word Count", fill= "") +
theme()Surprisingly, the English texts are longer.
Next, I have to pre-process the text data to be able to include it into my model. Text data is inherently high-dimensional, so to reduce this dimensionality the following steps will be applied:
df$text_cleaned <- gsub("[[:punct:]]", " ", df$fulltext)
df$text_cleaned <- gsub("[[:cntrl:]]", " ", df$text_cleaned)
df$text_cleaned <- gsub("[[:digit:]]", " ", df$text_cleaned)
df$text_cleaned <- gsub("^[[:space:]]+", " ", df$text_cleaned)
df$text_cleaned <- gsub("[[:space:]]+$", " ", df$text_cleaned)
df$text_cleaned <- tolower(df$text_cleaned)df$text_cleaned <- removeWords(df$text_cleaned, stopwords("english"))
df$text_cleaned <- removeWords(df$text_cleaned, stopwords("german"))token.df <- df %>%
tidytext::unnest_tokens(word, text_cleaned) %>%
filter(nchar(word) > 1) %>%
filter(nchar(word) < 30)
token.df %>%
count(word, sort = TRUE) %>%
ungroup() %>%
top_n(20, n) %>%
knitr::kable(align="l")| word | n |
|---|---|
| wohnung | 12264 |
| apartment | 9732 |
| zimmer | 8800 |
| room | 8529 |
| min | 8365 |
| berlin | 5994 |
| bahn | 5187 |
| restaurants | 4511 |
| minuten | 4289 |
| flat | 4200 |
| küche | 3877 |
| city | 3862 |
| nähe | 3800 |
| unterkunft | 3488 |
| bars | 3228 |
| qm | 3060 |
| direkt | 2992 |
| liegt | 2983 |
| station | 2955 |
| lage | 2916 |
bigram.df <- df %>%
unnest_tokens(bigram, text_cleaned,
token = "ngrams", n=2)
bigram.df %>%
count(bigram, sort = TRUE) %>%
ungroup() %>%
top_n(20, n) %>%
knitr::kable(align="l")| bigram | n |
|---|---|
| u bahn | 2699 |
| s bahn | 1870 |
| zimmer wohnung | 1497 |
| wohnung liegt | 1287 |
| prenzlauer berg | 1083 |
| living room | 1081 |
| city center | 989 |
| walking distance | 982 |
| unterkunft gut | 936 |
| bars restaurants | 891 |
| paare alleinreisende | 848 |
| gut paare | 832 |
| unterkunft nähe | 811 |
| restaurants bars | 786 |
| alleinreisende abenteurer | 771 |
| wohnung befindet | 751 |
| unmittelbarer nähe | 745 |
| unterkunft lieben | 733 |
| st pauli | 689 |
| lieben wegen | 678 |
corp <- corpus(df$text_cleaned)
docvars(corp)<-df$city #attaching the class labels to the corpus message text
col <- RColorBrewer::brewer.pal(10, "BrBG") c.plot <- corpus_subset(corp, docvar1=="Berlin")
c.plot<-dfm(c.plot, tolower = TRUE, remove_numbers = TRUE, remove=stopwords("SMART"))
textplot_wordcloud(c.plot, min.freq = 250, color = col)c.plot <- corpus_subset(corp, docvar1=="Hamburg")
c.plot<-dfm(c.plot, tolower = TRUE, remove_numbers = TRUE, remove=stopwords("SMART"))
textplot_wordcloud(c.plot, min.freq = 200, color = col)c.plot <- corpus_subset(corp, docvar1=="München")
c.plot<-dfm(c.plot, tolower = TRUE, remove_numbers = TRUE, remove=stopwords("SMART"))
textplot_wordcloud(c.plot, min.freq = 50, color = col)c.plot <- corpus_subset(corp, docvar1=="Köln")
c.plot<-dfm(c.plot, tolower = TRUE, remove_numbers = TRUE, remove=stopwords("SMART"))
textplot_wordcloud(c.plot, min.freq = 50, color = col)c.plot <- corpus_subset(corp, docvar1=="Frankfurt")
c.plot<-dfm(c.plot, tolower = TRUE, remove_numbers = TRUE, remove=stopwords("SMART"))
textplot_wordcloud(c.plot, min.freq = 50, color = col)c.plot <- corpus_subset(corp, docvar1=="Stuttgart")
c.plot<-dfm(c.plot, tolower = TRUE, remove_numbers = TRUE, remove=stopwords("SMART"))
textplot_wordcloud(c.plot, min.freq = 50, color = col)c.plot <- corpus_subset(corp, docvar1=="Dresden")
c.plot<-dfm(c.plot, tolower = TRUE, remove_numbers = TRUE, remove=stopwords("SMART"))
textplot_wordcloud(c.plot, min.freq = 50, color = col)